Use admission_data.csv
for this exercise.
# Load and view first few lines of dataset
import pandas as pd
import numpy as np
df = pd.read_csv('admission_data.csv')
df.head()
# Proportion of students that are female
len(df[df['gender'] == 'female'])/df.shape[0]
# Proportion of students that are male
1 - _
# Admission rate for females
df[df['gender'] == 'female']['admitted'].mean()
# Admission rate for males
df[df['gender'] == 'male']['admitted'].mean() #admission rates for females appear to be lower
# What proportion of female students are majoring in physics?
# given that a student is female, what is the probability they major in physics
# that is the proportion of females and physics majors divided by the proportion of females
# since the denominators are the same, we can just get counts of each and take the ratio
df.query('gender == "female" and major == "Physics"').count()[0]/len(df[df['gender'] == 'female'])
# What proportion of male students are majoring in physics?
df.query('gender == "male" and major == "Physics"').count()[0]/len(df[df['gender'] == 'male']) # many more males apply
# Admission rate for female physics majors
# That is what proportion of females who apply in physics are admitted
fem_adm_phys = df.query('gender == "female" and major == "Physics" and admitted == True').count()[0]
fem_phys = df.query('gender == "female" and major == "Physics"').count()[0]
fem_adm_phys/fem_phys
# Admission rate for male physics majors
# That is what proportion of males who apply in physics are admitted
male_adm_phys = df.query('gender == "male" and major == "Physics" and admitted == True').count()[0]
male_phys = df.query('gender == "male" and major == "Physics"').count()[0]
male_adm_phys/male_phys #female admissions in physics are higher
# What proportion of female students are majoring in chemistry?
df.query('gender == "female" and major == "Chemistry"').count()[0]/len(df[df['gender'] == 'female'])
# What proportion of male students are majoring in chemistry?
df.query('gender == "male" and major == "Chemistry"').count()[0]/len(df[df['gender'] == 'male']) #many fewer males
# Admission rate for female chemistry majors
fem_adm_chem = df.query('gender == "female" and major == "Chemistry" and admitted == True').count()[0]
fem_chem = df.query('gender == "female" and major == "Chemistry"').count()[0]
fem_adm_chem/fem_chem
# Admission rate for male chemistry majors
male_adm_chem = df.query('gender == "male" and major == "Chemistry" and admitted == True').count()[0]
male_chem = df.query('gender == "male" and major == "Chemistry"').count()[0]
male_adm_chem/male_chem #fewer males are admitted into chemistry as well as physics
# Admission rate for physics majors
df[df['major'] == "Physics"]['admitted'].mean()
# Admission rate for chemistry majors
df[df['major'] == "Chemistry"]['admitted'].mean()
Many more females applied to chemistry, which had a lower admissions rate. Therefore, they had an overall lower admission rate. Though, females had higher admission rates conditionally in both physics and chemistry. This is known as Simpson's Paradox.